Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix a few bugs #270

Merged
merged 4 commits into from
Aug 5, 2016
Merged

fix a few bugs #270

merged 4 commits into from
Aug 5, 2016

Conversation

BurntSushi
Copy link
Member

This fixes both #264 and #268. It also fixes an unreported bug where the DFA cache size could grow (a lot) bigger than the bound set by the caller.

Typically, when a DFA blows up in size, it happens for two reasons:

  1. It accumulates many states.
  2. Each state accumulates more and more NFA states.

Our previous approximation for the size of the DFA accounted for (1) but
used a constant for the size of (2). This can turn out to result in very
large differences (in the MBs) between the approximate and actual size
of the DFA.

Since computing the actual size is expensive, we compute it as a sum as
states are added.

The end result is that we more stringently respect the memory set by the
caller.
The specific problem here is that our literal search doesn't know about
anchors, so it will try to search all of the detected literals in a regex.
In a regex like `a|^b`, the literal `b` should only be searched for at the
beginning of the haystack and in no other place.

The right way to fix this is probably to make the literal detector smarter,
but the literal detector is already too complex. Instead, this commit
detects whether a regex is partially anchored (that is, when the regex has
at least one matchable sub-expression that is anchored), and if so,
disables the literal engine.

Note that this doesn't disable all literal optimizations, just the
optimization that opts out of regex engines entirely. Both the DFA and the
NFA will still use literal prefixes to search. Namely, if it searches and
finds a literal that needs to be anchored but isn't in the haystack, then
the regex engine rules it out as a false positive.

Fixes rust-lang#268.
If the caller asks for captures, and the DFA runs, and there's a match,
and there are actually captures in the regex, then the haystack sent to
the NFA is shortened to correspond to only the match plus some room at the
end for matching zero-width assertions. This "room at the end" needs to be
big enough to at least fit an UTF-8 encoded Unicode codepoint.

Fixes rust-lang#264.
Docopt uses lazy_static! 2.x, but lazy_static required a new minimum
Rust version in 2.1.
@BurntSushi BurntSushi merged commit 11447f0 into rust-lang:master Aug 5, 2016
@BurntSushi BurntSushi deleted the fixes branch August 5, 2016 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant